EasyVisa Project

Context:

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:

Data Description

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

Objective:

To build a model to predict likelihood of visa.

Import necessary libraries

Read the dataset

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Summary of the dataset.

Observations- Data Overview

EDA

Univariate analysis

Observations on number of employees

Observations on business year of establishment

Observations on prevailing wage

Over 8000 applicants were denied a visa. Close to 1/3 of all applicants are denied at this time.

Bivariate Analysis

Education vs certification

Education vs Case Status

Continent vs Case Status

Experience vs Case Status

Unit of Wage vs Case Status

Prevailing Wage vs Case Status

Observations

Data Preprocessing

Model Building - Approach

  1. Data preparation (complete)
  2. Split the data into the train and test set.(complete)
  3. Train models on the training data.
  4. Try to improve the model performance using hyperparameter tuning.
  5. Test the performance on the test data.

Split Data

The Stratify arguments maintain the original distribution of classes in the target variable while splitting the data into train and test sets.

Model evaluation criterion

A visa model can make wrong predictions as:

  1. Predicting a certification when an employee is denied.
  2. Predicting a denial when an employee should be certified (given a visa).

Which case is more important?

  1. If the model predicts a a certification when an employee is denied, the employee is hurt.
  2. If the model predicts a denial when an employee should be certified (given a visa), the labor market is underserved and large and small companies are hurt.

Which metric to optimize?

Let's define a function to provide metric scores on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models. This should reduce computational requirements.

Decision Tree Classifier

Hyperparameter Tuning

Bagging Classifier

Hyperparameter Tuning

AdaBoost Classifier

Hyperparameter Tuning

Gradient Boosting Classifier

Hyperparameter Tuning

Random Forest Classifier

Hyperparameter Tuning

XGBoost Classifier

Hyperparameter Tuning

Stacking Classifier

Comparing all models

Feature importance of XGBoost

Conclusion: